IMDB Top 250 Movies

Variables

Numeric

  • Num, Year, Released, Runtime, Metascore, imdbRating, imdbVote, imdbID

Text

  • Title, Genre, Director, Writer, Actor, Plot, Language, Country, Awards, Type, DVD, BoxOffice, Production, Websites

DataMunging Pt1

Type Modification

We modified values from text to numeric outputs.

  • Runtime -> “123min”
  • BoxOffice -> “$1,234,567”
  • imdbVoting -> “1,234,567”

DataMunging Pt2

Data Split

We split single-cell multi-entry to multi-cell single entry for following values.

  • Genre -> “Drama, Action”
  • Actor -> “Tim Robbins, Morgan Freeman”
  • Writer -> “Mario Puzo (screenplay)”
  • Language -> “English, Italian, Latin”
  • Country -> “USA, UK”
  • Awards -> “Won 3 Oscars. Another 23 wins & 27 nominations.”

DataMunging Pt3

Removed

We removed columns as they were irrelevant for analysis.

  • Released
  • Plot
  • imdbID
  • Type
  • DVD
  • Website

DataMunging Pt4

Consistency and Completeness

We tested for completeness.

##          Num        Title         Year      Runtime     Director       Writer 
##        1.000        1.000        1.000        1.000        1.000        0.996 
##    Metascore   imdbRating    imdbVotes    BoxOffice   Production  First Actor 
##        0.708        1.000        1.000        0.300        1.000        1.000 
## Second Actor  Third Actor Fourth Actor       Awards  Nominations  First Genre 
##        1.000        1.000        1.000        1.000        1.000        1.000 
## Second Genre  Third Genre           l1           l2           l3           l4 
##        0.908        0.632        1.000        0.496        0.224        0.088 
##           l5           l6           l7           c1           c2           c3 
##        0.032        0.012        0.004        1.000        0.296        0.088 
##           c4           c5           c6           c7           c8           c9 
##        0.024        0.012        0.004        0.004        0.004        0.004

Exploratory Data Analysis

1D Analysis

We observed characteristics of singular variables and presented them using appropriate charts.

Year

Runtime

Directors Pt1

There were 155 unique directors.

There were too many directors, so we only plotted occurrence of four or more.

Directors Pt2

##                     x freq
## 4    Alfred Hitchcock    9
## 9        Billy Wilder    7
## 14    Charles Chaplin    5
## 16  Christopher Nolan    7
## 36        Frank Capra    4
## 86    Martin Scorsese    7
## 108 Quentin Tarantino    4
## 115      Ridley Scott    4
## 134   Stanley Kubrick    8
## 136  Steven Spielberg    7

Directors Pt3

Writers Pt1

There were 213 unique writers.

There were too many writers, so we only plotted occurrence of four or more.

##                     x freq
## 27    Charles Chaplin    5
## 158 Quentin Tarantino    4
## 185  Stanley Kubrick     6
## 187     Stephen King     4

Writers Pt2

Metascore

imbdRating

imbdVotes

boxoffice

Production Pt1

There were 89 unique productions.

There were too many production companies, so we only plotted occurrence of four or more.

Production Pt2

##                                           x freq
## 1                          20th Century Fox   15
## 9                      Buena Vista Pictures    4
## 11                        Columbia Pictures   10
## 41                                      MGM    9
## 45                            Miramax Films    7
## 46                          New Line Cinema    6
## 55                       Paramount Pictures   17
## 62                            Sony Pictures    5
## 72 Twentieth Century Fox Home Entertainment    4
## 73                           United Artists   15
## 78                       Universal Pictures   14
## 81                     Walt Disney Pictures    8
## 82                             Warner Bros.   10
## 83                    Warner Bros. Pictures   27

Production Pt3

Actors and Actress Pt1

There were 773 actors and actress.

There were too many actors and actresses, so we only plotted occurrence of four or more.

##                   x freq
## 10        Al Pacino    4
## 97    Carrie Fisher    4
## 101      Cary Grant    5
## 106 Charles Chaplin    4
## 120  Christian Bale    4
## 281   Harrison Ford    7

Actors and Actress Pt2

Awards Pt1

Awards Pt2

There were 28 movies with 75 awards or more

##                                                Title Awards imdbRating
## 1:                                   The Dark Knight    153        9.0
## 2:                                  Schindler's List     78        8.9
## 3:     The Lord of the Rings: The Return of the King    208        8.9
## 4: The Lord of the Rings: The Fellowship of the Ring    117        8.8
## 5:                                         Inception    154        8.8
## 6:             The Lord of the Rings: The Two Towers    120        8.7

Nominations Pt1

Nominations Pt2

There were 25 movies with 120 nominations or more.

##                                                Title Nominations imdbRating
## 1:                                   The Dark Knight         153        9.0
## 2:     The Lord of the Rings: The Return of the King         122        8.9
## 3: The Lord of the Rings: The Fellowship of the Ring         124        8.8
## 4:                                         Inception         203        8.8
## 5:             The Lord of the Rings: The Two Towers         138        8.7
## 6:                                      Interstellar         142        8.6

Awards and Nominations Pt1

There are 34 movies with 75+ awards or 120+ nominations

##                                                Title Awards Nominations
## 1:                                   The Dark Knight    153         153
## 2:                                  Schindler's List     78          33
## 3:     The Lord of the Rings: The Return of the King    208         122
## 4: The Lord of the Rings: The Fellowship of the Ring    117         124
## 5:                                         Inception    154         203
## 6:             The Lord of the Rings: The Two Towers    120         138
##    imdbRating
## 1:        9.0
## 2:        8.9
## 3:        8.9
## 4:        8.8
## 5:        8.8
## 6:        8.7

Awards and Nominations Pt2

There are 19 movies with 75+ awards and 120+ nominations

##                                                Title Awards Nominations
## 1:                                   The Dark Knight    153         153
## 2:     The Lord of the Rings: The Return of the King    208         122
## 3: The Lord of the Rings: The Fellowship of the Ring    117         124
## 4:                                         Inception    154         203
## 5:             The Lord of the Rings: The Two Towers    120         138
## 6:                                      The Departed     96         134
##    imdbRating
## 1:        9.0
## 2:        8.9
## 3:        8.8
## 4:        8.8
## 5:        8.7
## 6:        8.5

Awards and Nominations Pt3

There are 9 movies with 75+ awards and 120- nominations

##                  Title Awards Nominations imdbRating
## 1:    Schindler's List     78          33        8.9
## 2: Saving Private Ryan     79          74        8.6
## 3:              WALL·E     91          90        8.4
## 4:     American Beauty    108          98        8.4
## 5:   L.A. Confidential     87          77        8.3
## 6:                  Up     76          82        8.3

Awards and Nominations Pt4

There are 6 movies with 75- awards and 120+ nominations

##                      Title Awards Nominations imdbRating
## 1:            Interstellar     42         142        8.6
## 2:        Django Unchained     58         151        8.4
## 3: The Wolf of Wall Street     38         170        8.2
## 4:               Gone Girl     64         177        8.1
## 5:      The Imitation Game     45         150        8.1
## 6:             The Martian     34         187        8.0

Genre

There are 24 unique genres.

Language Pt1

There are 44 unique languages.

##             x freq
## 3      Arabic    8
## 5   Cantonese    4
## 8     English  250
## 10     French   41
## 11     German   32
## 18    Italian   17
## 19   Japanese    6
## 21      Latin   13
## 31    Russian   12
## 35    Spanish   31
## 40 Vietnamese    4

Language Pt2

Country Pt1

There are 31 unique countries.

##            x freq
## 1  Australia    6
## 4     Canada    6
## 7     France   12
## 8    Germany   11
## 12   Ireland    4
## 13     Italy    4
## 27        UK   55
## 29       USA  233

Country Pt2

Number of Awards, BoxOffice, and imbdRatings Pt1

Number of Awards, BoxOffice, and imbdRatings Pt2

Nominations and imdbRatings

Nominations and Year

Nominations and imdbRatings Pt3

IMDB Ratings and Production Company

Awards and Production

Correlation Pt 1

##                    Year     Runtime   Metascore imdbRating   imdbVotes
## Year         1.00000000  0.17938049 -0.34085225 0.04549597  0.53623097
## Runtime      0.17938049  1.00000000 -0.06702619 0.24776081  0.24974676
## Metascore   -0.34085225 -0.06702619  1.00000000 0.17211994 -0.09800265
## imdbRating   0.04549597  0.24776081  0.17211994 1.00000000  0.65668014
## imdbVotes    0.53623097  0.24974676 -0.09800265 0.65668014  1.00000000
## BoxOffice    0.33354876  0.16853031  0.07890235 0.12349854  0.38711525
## Awards       0.47198940  0.19118929  0.30717283 0.19979238  0.44580066
## Nominations  0.61632678  0.18631483  0.11100696 0.08192601  0.47425298
##              BoxOffice    Awards Nominations
## Year        0.33354876 0.4719894  0.61632678
## Runtime     0.16853031 0.1911893  0.18631483
## Metascore   0.07890235 0.3071728  0.11100696
## imdbRating  0.12349854 0.1997924  0.08192601
## imdbVotes   0.38711525 0.4458007  0.47425298
## BoxOffice   1.00000000 0.1831188  0.20620724
## Awards      0.18311879 1.0000000  0.84812343
## Nominations 0.20620724 0.8481234  1.00000000

Correlation Pt 2

We can see the plots of correlation and we can observe that years seem to be decently correlated with other variables. On a different note, imdbRatings is poorly correlated with other variables except for imdbVotes. This is interesting since it shows that imdbRatings is not particularly related to other factors that should affect the ratings (ie metascore, awards, and nominations). However, we will be choosing imdbRatings as our values of interest since the question I want to ask is what makes a movie great, and a metric for good movie is ratings.

ImdbRating and MetaScore

ImdbRating and Year

ImdbRating and Runtime

ImdbRating and ImdbVotes

ImdbRating and BoxOffice

ImdbRating and Awards

ImdbRating and Nominations

Here we have graphed all combinations of imdbRatings to other elements, and we can see that the data is generally distributed randomly except for imdbVoting numbers. However, this also could be part of the effect of missing data. Metascore (70.8%) and boxoffice(30%) is missing some data and the format of awards and nomination could have missed some values or have incorrect values since some movies are recent and the awards and nominations are not updated.

ImdbRatings and Genre

Contributions

Melvin’s contribution: - Melvin considered which variables to analyze (imdb and Metascore ratings with Awards and BoxOffice), and how he would investigate those trends between those variables. - Melvin created an interactive scatter plot using ggplot and plotly. His lines of code provide information about each film (represented as points in the scatterplot) through label and shape features of ggplot. - He also took into account the story told by the data and possible conclusions that could be drawn from it. - The first scatter plot investigates the relationship between Award, BoxOffice, and Metascore, and compares averages of Award counts and BoxOffice for each film. - The second scatterplot looks more closely at the top left region of the graph, investigating the genre of films in that area.

Raymond’s contribution: - explored the relationship between imdbRating and Nominations. - The first graph Raymond used ggplot to show the variables of x = nominations and y = imdbRating. - The Raymond explored another ggplot, to show the the variables of x = Years and = Nominations. - Finally Raymond explored the relationship between imdbRating, imdbVotes, and Nominations via a 3D plotly interactive graph. Where x = imdbRating, y = imdbVotes, z = Nominations. - Raymond explained the story that a higher imdbRating do not equate that a movie will have more nominations.

Stephanie’s Contribution: - Stephanie began by exploring the data set, considering the stories each variables could tell and asking questions about how we could best analyze and answer the questions. - She applied the material that we previously learned about in class, using the function she created for our past assignment: > 1-colSums(is.na(df)/lengths(df)). This function measures the completeness and consistency of the data.For every column, this will give you the ratio of NA (missing values) to the total values, subtracted from 1, for each column(variable)–this works for both completeness and consistency. - Stephanie chose to investigate the relationship between IMDb ratings by each production company, and so she constructed a bar plot, y= production, x= IMDbratings  - Stephanie also considered the awards given to the different types of production companies and how this had a connection to the ImdbRatings, and so she created a bar plot that shows the amount of awards per film company.

Paul’s Contribution: - Data Munging: All of data cleaning and munging process - EDA: All of single variable analysis - EDA: correlation plot across all numerical values - EDA: Analysis on imdbRatings on other numerical values - EDA: imdbRatings distribution on genre types - Powerpoint: Creating and merging powerpoint into single coherent powerpoint - Report: Modify the powerpoint to be in report form